Skip to content

ci(bench-gpu): harden teardown, cap pairs at 32, fix CUDA comment#736

Merged
JuArce merged 1 commit into
gpu_benchmarksfrom
gpu-bench-review-fixes
Jun 29, 2026
Merged

ci(bench-gpu): harden teardown, cap pairs at 32, fix CUDA comment#736
JuArce merged 1 commit into
gpu_benchmarksfrom
gpu-bench-review-fixes

Conversation

@MauroToscano

Copy link
Copy Markdown
Contributor

Review follow-ups for #724, targeting its branch so they fold into that PR. Only .github/workflows/benchmark-gpu.yml changes.

Correctness / cost

  • Teardown can no longer leak a paid box. The instance id is recorded only after vastai create instance returns and its JSON parses, but teardown was gated solely on that id file. A box created in that window — a concurrency cancel (cancel-in-progress: true) landing right after create, or a --raw schema change breaking the parse — would exist, bill indefinitely, and teardown would print "nothing to destroy". Teardown now falls back to destroying by the unique RUN_LABEL (gpu-bench-<run_id>-<run_attempt>) when no id is recorded, tolerating both possible vastai show instances --raw shapes and only ever matching this run's own box.
  • Pairs capped at 32 (was 40), odd counts rounded up to even, timeout raised to 210 min. At 40 pairs the worst case (80 proves + slow provisioning + dual CUDA build) could exceed the old 180-min timeout and die after the expensive build — paying for the box but producing no result. 32 keeps the worst case within budget; rounding odd→even honours the AB/BA balanced design (the "even is ideal" comment was previously unenforced).

Docs

  • Fix the CUDARC_PIN comment. It claimed "the boxes are CUDA 13.0", which contradicts the actual pin cuda-12080 (CUDA 12.8) and the cuda_max_good>=12.8 offer floor. Corrected to ~12.8, and tied to the MIN_DRIVER>=580 guard as the opposite (too-old vs too-new) end of the same driver-compatibility window. The pin value itself was already correct — only the explanatory comment was wrong.

Hardening (defense-in-depth)

  • Don't dump the full create.json. Replaced cat create.json with a jq projection of only the needed fields, so an unexpected sensitive field in the --raw response can't land in the (collaborator-/world-readable) run log.
  • Validate the workflow_dispatch branch name against the git-ref-safe charset before it is interpolated into the remote bash -lc command. workflow_dispatch is write-access only, so this isn't a privilege gain — just avoids feeding an unvalidated ref into a remote shell.
  • Surface workflow_dispatch failures. The run-summary write lived at the end of the bench step, so it was skipped whenever the bench failed (set -e/pipefail) — and dispatch runs have no PR comment step, so a failed dispatch showed nothing but the raw log. Moved it to an always() step that prints the result on success and the log tail on failure.

Deliberately left as-is per discussion: template name (NVIDIA CUDA Lambda VM 64GB is correct), building/running PR code as root on the rented box (bounded — box holds no secrets, destroyed after, collaborator-gated), no global cost cap (cost acceptable given trusted-collaborator gating), and issues: write (matches bench-abba.yml).

Validated with actionlint (clean) and unit-tested the clamp/round, branch validation, and teardown jq filter logic.

Note: #724's description still says pairs clamp to [2,40] — worth updating to [2,32] there.

Review follow-ups on the GPU benchmark workflow:

- Teardown: fall back to destroying by the unique RUN_LABEL when no instance
  id was recorded. The id file is written only after `create` succeeds and its
  JSON parses, so a box created in that window (concurrency cancel, or a parse
  failure) could otherwise leak and bill indefinitely.
- Cap pairs at 32 (was 40) and round odd requests up to even (the AB/BA design
  wants even N); raise the job timeout to 210 min so a worst-case 32-pair run
  (64 proves + slow provisioning + dual CUDA build) fits without timing out
  after the expensive build.
- Fix the CUDARC_PIN comment: the boxes are ~CUDA 12.8 (matching cuda-12080 and
  the cuda_max_good>=12.8 offer floor), not 13.0; tie it to the MIN_DRIVER guard
  as the opposite end of the same compatibility window.
- Log only the needed fields of create.json instead of the full --raw response,
  so an unexpected sensitive field can't land in the run log.
- Validate the workflow_dispatch branch name before it is interpolated into the
  remote `bash -lc` command.
- Move the run-summary write into an always() step so workflow_dispatch failures
  are visible in the Actions summary rather than only the raw step log.
@JuArce JuArce merged commit 80b6963 into gpu_benchmarks Jun 29, 2026
11 checks passed
@JuArce JuArce deleted the gpu-bench-review-fixes branch June 29, 2026 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants